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An Investigation of Methods. for Reducing Sampling Error 
in Certain IRT Procedures * 

Abstract 



The sampling errors of maximum likelihood estimates of item-respongfc 

w * * 

theory parameters are studied in t;he case where both people and item 

parameters *are estimated simultaneously/ A check on the. validity of the 

standard error formulas is carried out". The effect of varyifife sample 

size, test length, and the shape of the ability distribution is 

investigated. Finally, the effect of anchar^test length on the standard 

error of item parameters is studied numerically for the situation, common 

in equating studies, where two groups of examinees each take d different 

test form together with* the same anchor test. The results encourage the 

use .of rectangular or blmodal. ability distributions, also the use of very 

short an o^or^ tests"; ? 

S ' ' ... ■ * . • 
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An investigation of, Methods for Reducing Sampling Error 
in Certain IRT Procedures* 

In IRT until now, the sampling variances and covariances for maximum 

4 

likelihood estimates of item parameters have usually -been computed by 
assuming the abilities to be known; the sampling variances and covaridnces 
for ability estimates were. computed by assuming the item parameters to be 
known. In this paper, a suggested method for computing the sampling. / 
variahce-covariance matrix when all parameters are unknown (Lord and 
Winger sky, 1983) will be used to try to answer various .practical 
questions. Section 2 presents needed additional, thpugh not conclusive, 
evidence that the new method for computing the variance-covariafnce matrix 

* ■ • • 4 '4 

r I a ■ 

-yields correct results." Section 3 investigates the effect of changing the 
number of items or the number or distribution of people oiV £he standard 

a 

errors of the item parameters and of the abilities. Section 4 "presents a 
technique. for displaying and understanding the standard errors and 
sampling covariances of estimates of item parameters. 

Section^ deals with the practically important situation where we 
have two, tests that contain a set of items i# common and these tests are 
administered to tWb separate groups of examinees. A problem in item 
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banking or test equating is putting the parameter estimates for the two 
tests <on a common scale. One way to do this is to estimate all of the 
parameters for both tests in one calibration run. When this is done, how 
does tfie number-^and^quality of the common items 'affect the standard 
errors of the parameter estimates for the unique (noncommon) items? 

• 1 • Pre liminaries 

* ■ — - ■ ~~~ 

The three-parameter Birnbaum logistic model is used throughout. The 
probability of > examinee a answering item i correctly is 



,S*a - ci + (1 - c i )/(l.'+ exp(-1.7a i (e a - b^)) 



(1) 



where ^ is the discrimination of item i ; is the difficulty 

for the item, c^ is the, lower asymptote of the item response 
function* and CT a is the ability for examinee a . In a typical , 
. calibration run, poorly estimatable cj are ordinarily fixed at 'some 
common value. In this paper, however, all^ c^ are considered unknown 

* r 

and must be estimated. In treating all of the c^ * as unknown we are 

/ 1 ... • 

looking at , the ""worst case" standard errors. ' m 

In IRT, the origin and unit of measurement of , the ability scale is § 
arbitrary. Until this scale is specified all parameters except the 
are unidentifiable . The origin and unit of the ability scale must be 
specified in terms of (as a function of) the true parameters. If the 
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origin and unit of the ability scale were specified in terras of the 
parameter estimates, then the true parameters would be, undefined. Since 
the true parameters are unknown but depend' on the scale used, this means- 

that the scale origin and the scale unit (each defined as a function of 

f . ■ ■ 

•s ■ * ' 
the true parameters) must be estimated from the daca. The estimated 

origin and* scale unit are obviously subject to sampling errors, which 
affect the accuracy of all parameter estimates. It is therefore important, 
to define the origin and unit each by a function of parameter^/ that can be 
estimated with good accuracy. 

The scale recommended in Lord and" Wingersky (1983) and' used here 
requires that the .mean of the difficulty parameters of certain selected 
items be 0 (the origin) and that the difference between two such means 
for two sets of selected items be 1 (the scale unit). This scale irf.ll be 
referred to as the "capital" scale: parameters on this scale will be 
denoted by the capital letters A t , B A , C ± , G a . The "small" scale 
or the "LOGIST" »scalti, referred to by 16wer-case letters, is the scale 
^used^by the LOG I ST progrW' (Wingersky, Barton, and Lord j£l982>), the 
computer program used here for estimating the parameters of (1) by maximum 
likelihood. LOGIST sets a truncated mean of the estimated abilities to 0 
and a truncated standard deviation of the estimated abilities to 1. fhe 
following formulas convert the parameters from the LOGIST scale to the 
capital scale: 



©a =C 8 a ~ b6>/^ . 
k = /Bi - bo . 



. S\ 

Bi - (bi - b 0 )/k ,^ ' V 

t 

Ci - Ci , 

where bg and t>i are means of the b± for two selected subsets of 
items. The h capital scale is a. linear transformation of the LOGIST scale. 
The c-{ are not affected by the scale. 

2. Variance of p± , the Proportion Correct 

If we could prove that the maximum likelihood parameter estimates for 

the Birnbaum model are consistent when all. item and ability parameters are 

*. . ■ 

estimated simultaneously, the sampling variance-cOvariance matrix * ' 
described, in Lord and Wingersky (1983) would be the correct one to use. 
Since consistency has not; yet been proven mathematically any results that 
confirm the appropriateness of this variance-covariance matrix makes qne 
feel 'mpre comfortable about using it. - . 

'^pe sampling variance of Pi , the proportion of examinees in the 
sample who answer item i correctly, can "be computed directly from 
familiar standard formulas; it can also be computed with some effort from 
-the sampling variance-covariance matrix obtained by Lord and Wingersky 
(1983) • These two methods should give the same results if the Lord- 
Wingersky matrix is correct. 



The usual likelihood equation? 'for b* and for c* , obtained by 
setting the derivative of the likelihood function equal to zero, are 



(Lord, 1980, eq, 12 « 1 and 12 .2) 



A A A A 



* (u ia " W )( W " e 'i>W) " ° 
a-1 



(2) 



A . A "A A 



a-1 



(3) 



where u^ a is the score (0 or 1) of .'examinee a on item 1 , N is the 
number of examinees, and a caret denotes substitution of parameter esti- 
mates for true parameter values. Multiplying (3) by- , adding to (2), 
and transposing gives 



N A A 

2 p 4 (e) 
, l a 
-a-1 



N 

Z u 



a-1 



la 



Since 



1 N 

N * % 



(4) 



we h^ve 



1 N £ A 



(5) 



From ?A) and (5), we can derive two separate formulas for the variance 
of Pi • , 

For some group of examinees whose abilities are specified by the 
vector 9 = {0j ,62,... ,0 N ) , we have from . (4) that % 



N N 



var(p |6)--~ E Z cov(u la ,u la ,i '|0 ) , . 
1 " . N a-1 a'-l " 



' 1 N , • 

- -iy ,Z var(u la |e) , 



N a-i 



1 



N 



M 2 = P i <8 a ) W 
N . a»l 



(6) 



with 



vv - 1 - W • 



since cov(ui a ,uiai |8) - 0 when a * a 1 • Similarly, 



cov(p 1 ,pj |9) - 0 



(7) 



By the formula for the covariance between two'aume, we have from (S) 
for the aame group of examinees .that 



var( Pl |0) - -| eov|P i (0 a ),P 1 (0 8 )|0l , (8) 



N N 



cov(p. ,p.|6) - -~ r Z cov[P.(6 ),P.(e. )|0J , (9) 

1 . 

The cov[Pj(6 a ') , Pj(O b )|0) are evaluated by applying the delta method • 
(Kellcy, 1947, pp. 524-526; Kendall and Stuart, 1969, Section 10.6) to 
(1). For fixed 6 (for simplicity, the notation." |0 " la -omitted from 
the following formula) t . 

•covCP^^hPjCe^) - w la Wj b (t 1 .t Jb lcov<e ai i b ) - covci^y • 



- cov(6 a ,bj.) + cov^.bj)) + v^jblcovCa^ 0 !,) " cov^.bj)] 



" + , Y jb i ia lcov(e a' a j ) ' <=ovCb 1 .a J >l + v la v Jb covU^) ,* 



+ t^lcovCcj^.O^ - covCc^bj)]/!.? + [vj b cov^.aj) 



MM MM MM 

+ v la cov(a 1 ,Cj) ]/1.7 + t la lcov(0. a ,Cj) - cov^.Cj) ]/1.7 



+ cov(ci,cj)/(1.7) 2 } (10) 



i 



where 



w 



ia 



i.7Q 1 (e a ) 

1 - c. 
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'ia 



(e a - b^^cep - Cl ) 



. The standard errors for were calculated from (5) and ajgain from - 

(8) and (10) for each of the 45 items in the test described in Section 3. 
The results from the two different approaches agree to three 
significant digits for eabh item. The covCp^pj |8 ) . obtained from (9) and 
(10) were all of order 10" 7 or less. This gives us increased confidence ^ 
in the Lord-Wingersky sampling co variance *ma t r ix . •" / . ! 
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3. Effects of Changing Number of Items y Number of Examinees , or 
the Frequency\Distribution of Ability : 

To investigate the effect of changing the number of items, the y 
•number ...of examinees, or the distribution of abilities on the sampling 
errors of parameter estimates^^rioiis^sets of parameters were, specified. 
The simplest set of -parameters represents the administration of a 45-item 
test to 1500 examinees. The- numerical values used as the true 6 a were 
a spaced sample of 1500 0 a drawn from the ability estimates obtained by 
LOGIST for a regular administration of the Test of English as a Foreign 
Language (TOEFL) . A spaced sample of fifteen items were drawn from the 
sixty TOEFL items whose parameters were estimated in the "same run as the 
abilities. The estimated parameters for -these fifteen items were used as 
the true parameters. These fifteen. it^ms were then replicated twice to get 
a total of 45 items, where items 16-30 and items 31-45 have the same item 
parameters as items 1-15. Note that various parameters were specified, but*, 
no sets of artificial data were generated for this study, since sampling 
variances and covariances depend only on the true parameters , not on sample 

observations. . "a 

To investigate the effect of increasing the number of examinees , each 
of 1500 .6 a was repeated four times to represent the 8 a of 6000 
•examinees. To study the effect of increasing the number, of items , 
another 45 items were added exactly like the first 45 to create^a 90-item 
te^t. For a different distribution of abilities, a rectangular 
distribution of 1500 0 a between -3 and 3 was randomly generated. 



Tables 1-4 give the standard errors • of the parameter : estimates that 
would-be obtained from actual data in the various situations investigated . 
Only the standard errors for the fifteen unique items are given in the 
tables of the standard err^fb for the item parameters . Thie labilities are 
grouped into 16 intervals between -V and 3. Two of the intervals had no 
examinees. N is the number of examinees and n is the number of items. 
The values of both the "small" and "capital" parameters are given. The 
constants to convert from the small scale to the capital scale are 
bQ 3 -.305 - and k » 0.976 . .;v;'. ; . : 

Figure 1 contains plots corresponding to these tables. Gaps in 

the curve for the B^ are due to some points out of the; range of the / 

• v . .. 'A .. .... /■-V/-- Vv ; ^.' - 

plot. The standard error for was not plotted against 0% , since most 

of the were > equal, but against - 2/A± instead, - 2/h± is 

an indicator of the ability level at which, the item, response curve becomes 

asymptotic. The higher B^ - 2/ , the better one should be able to 

estimate C • , 

As expected, quadrupling the number of examinees halved the standard 
.errors of £he estimated item parameters; doubling the number of items, 
decreased the standard errors of the estimated abilities by a factor 
of /2 ..' Quadrupling the number of examinees reduces" the largest 
standard errors for * 0 a sharply^ but has little effect on the, smaller 
standard errors; doubling the ntijuber of items has only a moderate or 



Standard Errors for A. 



Bell-shaped distribution 



. Rectangular 



Item 
No. 



n=45 
N-1500 



n«90 
N^1500 



n»45 
N-6000 



n*45 
N*1500 



i 


0.99 


0.96 


0.234 


0.192 


2 


0.35 


0.34 


0.134 


0.131 


3 


1.38 


1.34 


0.318 


0.243 


4 


0.78 


0.76 


0.147 ; 


„ 0.126 


5 


.0.4:", 


0.41 


0.100 \ 


0.106 


6 


0.9. 


0.90 


0.178. 


0.145 


7 


0.92 


0.90 


0.179 


0.147 


8 


1.06 ' 


1.04 


0.209 


0.168 


9 


1.34 


1.31 


0.262 


0.205 


10 


1.50 


1.46 


0.317 


0.259 


11. 


0.87 


0.85 


0.180 


0.151 


12 


0.62 


0.60 


0.142 


0.1-28 


13 


1.09 


1.06 


0.234 


0.197 


14 


1.39 


, 1.36 


0.311 


> 0.265 


15 


"1..50 


1.46 


0.333 


0.283 



0.117 
0.067 
0.159 
0.073 
0.050 
0.089 
0.089 
0.104 
0.131 
0.158 
0.090 
0.071 
0.117 
0.156 
Q.166 



0.178 
0.072 
0.235 
0.099 
0.055 
0.120 
0.119 
0.141 
0.180 
0.231 
0.117 
> 0.086 
0.153 
0.204 
0.209 
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. • * Table 2 ° 
• Standard Errors for B. 



• Item 
No. 


i 


i 




Standard 


Error 8 ; for 


i • • 


Bell- 


•shaped distribution 1 


v ' ... • . ^ 

. Rectangular 


n=45 
N^ISOO 


n-90 
N»1500 


:H.''>n-45c.; 
N-6000 




1 


-2.01 


-1.75 


0.310 


0.466 


. . 0.258 




2 


-1.61 


-1.33. 


2.544 


2.344. 


V- 1.272 


1.470 


3 •••• 


-1.09 


-0.80 


0.353 


0.259 


0.177 . 


'0.242 . 


\ 4 


-0.77 


-0.48 


0.257 


0.240 


0.128 


: ■ 0 .177 


5 


-0.67 


-0.38 


0.965 


0.929 


0.483 < 


0.591. : 


6 


-0.34 


-0.04 


0.191 


0.161 


0.095 


0.141 . V 


. 7 


-0.15 


0 16 


0.165 


0.141 


0.082 


0.128 


# . 8 


0.00 


0.31 


0.143 


■■■■■ 0.117 


0.071 


•¥v^^:y ; ^;b'.ii3 : :^:: : >. 


9 


0.11 


c 0.42 


0.124 


0.096 


0.062 


0.096 


10 


0.26 


. 0.58 


0.1 io 


0.092 


0.055 


0.097 


11 • 


0.46, 


' 0.79 


0.103 


0.101 


-•/'?. 0.051 


Vvv 0;098 


12 


0.57 


0.90 


• 0.178 


0.179 


0.089 


0.148 


. 13 


0.68 


1.01 


0.085 


0.086 


0.043 


. 0.086 


14 


0.90 


l ; 23 


• 0.082 


0.080 


0.041 


0.076 


.. 15 


1.16 


•1.50 


0.103 


0.089 


0.052 


0.077 



V 
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* • :;r fable 3./^C'.::- v iv v\-> 
Standard Errors for C. 




■ Standard Errors for C/ 



Bell-shaped distribution Rectangular 



Item * 






n-45 ■ 


n»90 


n-45 


h-45 


No. 


c i 


C i ' 


N-1500 


N-1500 


N-6000 


N-1500 


1 


0.17 


0.17 * : 


0.598 


0.469 


0.299 


" ; ;^:"0.316:^ ; v-"- 


2 


0.17 


0.17 


0.715 


J 0.629 r* 


0.358 


0.409 


3 


0.17 


0'.17 


* 0;O96 Y ■ 


: 0.083 




.0.045 - 


4 


0.17 


0.17 


0.144 


0.123 


0.072 


0. A :">y l . 


5 


0.17 


0.17 


0.318 


0.280 


0.159 ' 




6 


0.17 


0.17 


0.071 


0.064 


0.035 , 


ry : - 0. 039 'Z\-'i 


7 


0.17 


0.17 


0.059 


0.054 


0.029 




8 


0.17 


0.17 


0.041 


0.039 


0.021. 


. 0.025 


9 


0.13 


0.13 


0.026 


0.025 


0.013 


0.018 


10 


0.34 


0.34 


0.026 


0.026 


0.013 


0.021 


11 


0.17 


0.17 


0.039 


0.038 


0.020 


0.025 


12 


0.17 


0.17 


0.068 


0.064 


0.034 


0.039 


13 


0.25 


0.25 


0.027 


0.027- 


0.014 


0.021 


14 


0.29 


0.29 


. 0.020 


0.020 


0.010 


0.018 ' 


15 


0.18 


0.13 . 


0.015 . 


0.015 


0.007 


0.015 iif- 
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Table 4 ? : / 

, • • ''-V/. ' "a ' 

Standard' Errors for 0 



e 

a 


• » i 

4 

e' 

a 




«- Standard 


Errors for 


I | 


. Bell- 


-shaped distribution *. 


Rp n £ pti ft ti Tfl t 


n»45 
N*1500 


n»90 

^•^1500^; 


n»45 
N-6000 


N-1500 


-2.75 


-2.51 


2.090 


^ 1.478 


1.331 


• i.453; 


-2*25 


-1.99 


1.296 


0.917 


0.879 


;^.; !; V0.955U%S^ 


-1.75 


-1.48 


0.861 


\ 0.609 ' 


0.621 


• : s 0 .669 , ; 


-1.25 


r-0.97 


0.607 


0.429- 


0.460 


0.491 


r0.75 


-0.46 


0.456 


. 0.322 


0.373 


0.390 


%.25 


0.06 


0.349 


0.247 . 


0.309' 


0.317. 


, 0.25 


0.57 


0.278 


0.196 


0,266 


;5*0.268 ". 


0.75 


1.08 


; 0.261 


0.185 


0.260 


V .' I 0.261 • 


1.25 


1.59 


0.303 


0.214 


0.292 " 


■ 0.295 


1.75 


2.11 * H 


' v 0.422 


6.298 


0.394 


• 0.401 


2.25 


2.62 


0.628 


0.444 ; 


0.58,9 


0.599 


2.75 


3.13 > 


0.931 


0.658' 


0.888 


~ 0.900 



mm 

mm. 
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7 
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7!7 






-i.o ; i.o 
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• n«45.N-IB00 
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small effect ooffche standard errors of item parameter estimates. Note 
that the effects discussed in the previous sentence cannot be investi- 
gated at all using the usual standard error formulas, Which assume either 
that the item parameters are known or else that the 9 a are known.: 

The rectangular distribution of abilities definitely gives better 
estimates of the item parameters than the bell-shaped distribution of 
abilities. For C t where' Bj. - 2/A t is low, the rectangular distribution 
^ gave standard errors nearly as low as the standard errors with quadruple 
the. number of examinees. . . . \''''J£'" 

,\ A . Displaying Standard Errors and Sampling Covariances - 

In looking at tables of standard .errors it is hard to see how the 
standard errors for A t , V, and C t interrelate and how the standard 
errors relate to the magnitude of "the parameters. A plot of the three- 
dimensional asymptotic joint normal .distribution ..of .A » B , and C , 
Xould be useful but difficult to read. % However, projections of the 
contours of this distribution onto the three two-dimensional .planes will 
give a graphical representation not only of the magnitude of the standard 
errors but also of the' sampling correlations between,.,the parameter 
estimates. The projected contours are two-dimensional ellipses. These 
plots are a refinement of a suggestion by Thomas Warm (personal 

Communication, 1982). ' ' ! 

For convenience, the subscript i will now; be dropped. To plot :tm. 
projection of the three dimensional contour onto the . (A,B) -plane, 
only var(A) , var<B) , and cov(A.B) are needed. The exponent of 
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the asymptotic bivariate normal distribution of A and B is given' by the' 
right side of (11) T The quadratic in brackets is asymptotii^lly distributed 
'as chi square with 2 degrees of freedom. The 95th percentile for a virLtii 
1 2 -degrees of freedom is 5.99. Thus 95 percent of the time the obtained 
(A,B) will lie within the ellipse given by the equation : 



% 1 ,j (A.- A) 2 Z 2p(A - A)(B - B) , (B - B)" ] 



5*99 



1 - P 



Var(A) / Var(A) Var( B) Var(B) 



(11) 



• r 



where 



Cov(A,B) 



✓ Var(A) Var(B) 



Similar equations apply for the projections onto the (A,C) - and (B,C) - 

-' ' - : ■ ■ ■ •* ■ 

planes. The ellipsis plotted from (11) for a given N is identical to the 4 

*'. " .• ' ■ _ • /• . ■ _ ■ 

5 3- percent ellipse that would be plotted for a. sample size N/4 . 

* The following procedure was used to plot a representative set of 

ellipses. A hypothetical test of 60 items was created by selecting 60 items 

from an operational SAT mathematics test and treating these: item parameter ^ 

estimates as the true' parameters. A standard normal distribution of 1000 
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abilities was generated. We then created 15 new items; with all combinations 
of the parameters a - .5 ,-1.0, 1.5 ; b - -2 , -1, 0, 1 , 2 ; and c « .15 . 
Using these new items, fifteen 61-item tests were created, each containing 
the 60 original items and one o£ the neV items. The sampling variance- 
covarianbe matrix for each of the f if teen^61-item tests was obtained. 
These matrices differ only because the 61st item differs 'for each matrix. 
Only the variances and covariances for the 61st item were used in (11) to 
compute the ellipses. • 

The plots were made for an N of 16,000 to avoid confusing overlap of 
the ellipses. These ellipses are also the 53% confidence ellipses for an N 
of 4000. The left, and bottom axes are labeled with the "small" scale, the 
right and top axes are labeled with the "capital" scale. The standard errors 
used are for -parameter estimates bn the capital scale. The .transformation 
parameters to transform from the small to the capital scale are T> 0 - .001 , 
k » 1.336 . The center of the ellipse is marked by a "+".;, 

Figure 2 shows the ellipses on the (A,B) -plane . The plot shows 
that the standard error of A increases with A . The standard error' of 
B increases as B approaches the extremes. The sampling correlation 
between A and B is moderately or. strongly positive for easy .items and 
moderately or strongly negative for hard items. 



23 



Figure 3 shows the projections onto the (B",C) -plane. At each value 
of B ther^ ;are three ellipses, which 'are concentric because c « C-« .15 
for all it ems I The longest ellipse along the C axis is for a « .5 , 



the middle ellipse is for a ■ KO' , and the shortest is for a « 1*5 • The 
other triples of ellipses are similarly ordered on a • The standard 6rror 
of C is larger for easy items and moderately small for difficult items; the 

,» A 

standard error of G decreases as a increases* As .a decreases, the 
sampling correlation between B and C becomes strongly positive except 



for hard "items where C is well determined* 

\ A Af 

Figure 4 shows the projections onto the (A,C) -plane. There are five 
concentric ellipses for each value of A . The ellipse .with the longest 



c -axis is for b »\r2.0 , the ellipse wit'h the shortest c -axis -is 

\ "a" * 

for b - 2.0 • Again C hap large standard errors for easy items 

y " ■ - ■ ■ - 

and for items with low a f s. . For hard items the sampling cor rela- 

» A ' i A 

tion between A and C is positive and sometimes high; for easy items, 
• * • • \ 

the correlation is negative. 

4. Standard* Errors for Two Tests with Common Items 

9 

„ Suppose that each bf\ two tests measuring the same ability is , 



administered to a different group of examinees. We want to use item 

\ 

response ' theory either to put the items for both tests 'into a common item 

' * \ " * 

pool or to equate the two tests. Fop either purpose it is necessary that 

all the estimated parameters, be on the same scale. 



Unless . equivalent groups of examinees are used, Methods for doing this ; 
usually require a subset * of items that are common -to both tests. The unique • 
items are the items in each test that are not common to the other test. The 
item parameters for each test can then either be estimated separately in two 



calibration runs or together in one calibration run. If the parameters are 
estimated in two separate runs, there are two different parameter estimates 
for each common item. These should be the same except for sampling error and 
the arbitrary, origin and unit of measurement of the; ability scale. There are 
several methods for determining the linear transformation ^necessary to trans- 



form th^ item .parameter estimates for both tests to the same/ scale. These 
'methods will not be described h§re (see Stocking and Lord, ^1983) . However , if 
all of the items for both tests are calibrated intone run, called a concurrent 
caLibr^tiOT^, the -parameters for both tests are automatical ly_pJi^^ 
scale and no linear transformation is necessary. This concurrent procedure is 
most ef f icien^^ fewer 
assumptions than other procedures. The concurrent procedure is the procedure 
studied hero* 

One question that arises when applying the common item method for 
putting the parameters for both tests on a common scale is: How many common . 
items are necessary? Vale, Maurelli, Gialluca , Weiss t and Ree (1981) 
investigated this problem using simulated data with 5, 15, and 25 cp^on items 
and three different shapes of the common item section test information curver^ 
peaked, normal, and rectangular. They also investigated many other linking ! 
methods • For the common item method they assumed that one already had good y 
estimates of the parameters for ' the common items and required that one have ;\;t.> 
enough common and unique items to get good estimates of ;the abilities.; They^; 



used two estimates of the abilities, one obtained from the' common items, the 
other from the unique items to determine the transformation to put the unique 
it^ms onto the common scale. They found that 15 to 25 items -were necessary and 
that the common item sections with a rectangular or normal information -function 
were— better -than- those with-a-peaked— information-function-.- } ~ — ~ — — 



Another , study to determine the number of common items necessary was done 
by McKinley and Reckase (1981). They compared the concurrent method and 
several other methods for obtaining the linear transformations using the 
two sets of item parameter estimates for the common items. A large set of 
items using real data from a multidimensional achievement test covering seven 
subareas was calibrated in one calibration run and these parameter estimates 

were used as the criterion for determining how well the other linking 

i _ • . • 

procedures put the parameter estimates for subsets of these items on a common 



scale. A chain of three links was created, that is, test A was linked to 

i ' .' 

test B through one set of common items, test B to test C through another 

a . • . y - ' . . '[ 

set of common items, and test C to test D through a third set. Five sam 
sizes ranging from 100 examinee to 2000 examinees were used. All four tests 
were then calibrated in one run for .the concurrent method for each sample.. The 
linking was done with 5, 15 and 25 common items. Each individual test was 50 
items long including the common items. McKinley and Reckase "concluded 
that 5 items were not adequate, 25 items were better than 15, but 15 were 
adequate for linking with the concurrent method. 

Given the sampling variance-covariance matrix for all parameter estimates 
in our single r concurrent run when all parameters are treated as unknown, we r 



can investigate tfhat effect the number of Common items has on ~ the sampling 
standard errors of the unique items in both tests* Note that this problem 
cannot t>e investigated at all with the limited sampling-error formulas; , 
that assume that either item or ability parameters are known* 



Numerical Procedures 



Suppose test 1 has a section of unique items labeled V4, and 'test 2 has 
a section of unique items labeled Z5 . Both tests have the same set of common 
items labeled CO . group of examinees, group X , took test 1, another 

group of examinees, group Y , took test 2. The information matrix B Ipq" » . 
which must be inverted to get the variance-covariance matrix, has the 
following structure (Lord and Wingersky * 1983) : 
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The . S subma trices ( Sn for the V4 items; S22 for th . e common 
items; S33 for the Z5 items) contain 3x3 Fisher information matrices 
for ai , bjL ; c^ on the diagonal. The T subma trices are the diagonal 
information matrices for the examinees : Tn for the examinees that took ^ 



test 1; T22 for the examinees tha£ took test 2. The F submat rices contain 
the vectors f ia , each of which is the 3 x 1 Fisher information vector 
for item i and examinee a . Note that for Group Y , this is 0 for the 
V4 items; for Group X , this is 0 for . Z5 . 

The matrix II pq l is inverted by grouping the abilities fc^ jroup X ;1 
into sixteen groups and by grouping the abilities for group Y intp / 
another set of sixteen groups. Then the formulas for inverting a [L ' 

partitioned matrix using the method described in Lord and Wingersky (1983) 



v are successively* applied. 

Data and Results r1 

< To study the- effect of the number of common items on the standard 
errors of the parameter estimates for the unique items , we selected two 
60-item SAT Mathematics tests <with< an additional 25-item common-item 
section. The 60 unique items in the first test will be ref erred. ^o^as^ V** 
and the 60 unique items in the second test will be referred to as Z5 . 
Estimates of all of the jparameters were, obtained in one concurrent LOGIST 
run. These estimates were tjreated as true parameter -values in computing^ 
the standard errors for all 145 items. 
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, ..We then doubled the length of the common item section by simply 
replicating the parameters for the 25 common items. Surprisingly, the 
standard errors for the~120 unique items in V4 and Z5 computed with 50 
common items agreed with ^the standard errors computed with only 25 common 
it em 8 to two decimal places. If doubling the number of common items makes 

1 so little difference, what is the effect of halving the number of common 
items? Or at the extreme, reducing the number of common items to 2? 

This is really not as absurd as it sounds. Providing the common items 
are not part of > the test score, other than improving the estimates of the 
abilities, the function of the' common items is to put the parameters ; / 
for the two sets of unique items on the same metric. If the model hd Ids, 
only a linear transformation is required to convert the parameters from one 

'~"> scale to another. , Only 2 parameters are necessary to determine this . 
linear transformation. With 2 common items we are estimating four param- 
eters that affect the scale, the two |a. f s influence the scale unit and 
the two b 's influence both the scale unit and origin. The two c f s are 
not affected by the scale. Consequently with 2 items we actually have 
two more parmeters than absolutely necessary. However, if the 2 common 
i terns have parameter estimates" with large standard errors , t he scale will "~~ 
be lesfc well determined than if the estimates have small standard errors. : 

----- - - - two common items on the standard errors of the 

unique items, we selected 2 "good" items and 2 "bad" item^ from the 25 ; ^ 

common items. ; The item parameters [and their 8 1 andard errors f o v 

the _ 2 "good" i terns were .f/'VV^'^ : 
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a 


SE(A) 


b 


SE<B) 




SE(C) 


.98 


-.09 . 


-.10 


.02 


.06 


.02 


.96 


.10 . 


.21 


.02 


.15 


.02 



The item parameters and their standard errors for the 2 "bad" common 



items were 

A ' .A 



a 


SE(A) 


b 


SE(B) 


c 


SE(C) 


.32 


.10 


-1.51 


.47 


.07 


.24 


.53 


•07 


1-1.19 


.12 


.07 


.10 



These standard errors were computed for the situation where all 25 common 
items are included in the parameter estimation run • _ 

We then: obtained the variance- covariance matrix for the V4 and 25 items; 
when only the 2 good common items are included in the estimation run and also 
■ tli^ variance-covariance matrix when only the 2 bad common items are used . 
The constants to transform from the small scale to the capital scale are 
b 0 ■" -•261 and k « 1*914 . Only V4 and Z5 it ems were used to compute 
bQ and k so that the dfnne transformation would apply to all four variance 
' covariance matrices* -.' 

Table 5 gives the medians, and the bottom and top quar tiles -of the 
s tandard errors f or i^A_, ^B_i^nd _ f or-the— Z4--and- VS^unique^items^^ 
computed for four different situations : using 50 common items , using 25 
common items , using 2 good common items , and using 2 bad common items . Using 
2 jgood common items gives smaller standard errors for the unique items thaii^ * 
using 2 bad common items. The standard errors using the 2 good items 
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Xable 5 ^ 1 
Comparison of the Standard Errors of Estimated Item Parameters across 

the Four Sets of Common Items 



Standard Errors for 
First Quart ile 
Median \ 
Third Quartile 



50 
Common 
Items 



0.114 
0.140 
0.224 



25 
Common 
Items 



0.115 
0.141 
0.226 



2 Good 
Common 
Items 



0.123 
0.151 
0.236 



2 Bad 
Common 
Items 



0.131 
0.163 
0.243 



Standard Errors fox 
First Quartile 
Median 

Third Quartile 



B 



0.029 
0.042 
0.066 



0.030 
0.042 
0.067 



0.034 
0.048 
0.072 



0.041 
0.056 
0.076 



Standard Errors for 
First Quartile 
Median 

Third Quartile - 



0.013 
10.027 
1.055 



0.013 
0.027 
0.055 



0.013 
0.028 
0.058 



0.013 
0.027 
0.056 



are not much larger than the standard errors using 25 common items* Even 
reliance on just 2 bad common items gives surprisingly good results. 
Since the purpose of the common items is to determine the scale, it. is not 
surprising that the number of common items has a negligible effect on the 
standard error of' C. f since c is independent of the ability scale. 

Table 6 gives \he^8tandard errors for the abilities computed with the 
four different sets of common items. Not surprisingly, if we increase the 
number of common items to 50 we reduce the standard error of the abilities, 
although not uniformly as shown by the ratio column* The standard error for 
the abilities at -2 were lower when computed using the two bad common' items , 
which were easy items, than when computed using the two good common items. 

Even though there is little difference between the standard errors when 
there are 2 common items and when there ate 25 common items, the parameter 
estimates- for the V4 and Z5 items will not have been adequately put on the 
same scale if all of the parameter estimates for V4 items err in one 
direction and all of the parameter estimates for Z5 items err in the 
opposite direction. Is this what will happen In practice?* To determine how 
~Wll an anchor— test of only 2 common items puts tests V4 and Z5 on the..;„. 
same scale, we reestimated-the parameters twice, once in a LOGIST run. with 
the items for Z5 and V4 and the two "good" common items , the other in 
a LOGIST run with the items for Z5 and V4 and the two "bad" common items. 

The estimated parameters for Z5 and V4 computed with the 25 common 
items will be used as the criterion for evaluating the calibrations 
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'.; f Table 6 

Comparison of the Standard Errors of Estimated Abilities across 
\ the' Four Sets of Common Items 







50 


: 25 




2 Good 


2 Bad 






Common 


Common 




Common 


Common 


e . , ; 


',y e ••' 
••- a 


Items 

■ JLJI. 


Items 
S.E. 


Ratio 


Items 
S.E. 


Items 
S.E." 


2.00 


ia8 


, '0.097 


0.109 


0^894 


0.127 . 


0.132 


1.00 


; 0.66 


0.089 


0.102 


0.870 


0.122 


0.126 


0.0 


"0.14 


0.100 


0.115 


! 0.874 


0.134 


0.138 


-1.00 . 


-0.39 


0.129 • 


0.145 


0.892 


0.165 


0.167 


-2.00 


.-rO.91 


0.221 


0.248 


0.891 


0.288 


0.281 
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with 2 common items. The 2 good common items did fairly well at putting 
the parameters on this scale. The 2 bad items did not do so well. 
The top plot in Figure 5 compares the b f s for the 60 unique V4 items 
estimated with 2 good items with the b f s estimated with 25 coumon 
items. Similarly, the bottom plot compares the b 9 a for the unique Z5 
items. If the parameters Atrere on the same metric the b •s in both plots 



should fall on a 45° line, 
distinguish. The two points 
the c •s fixed by LOGIST at 
in the other. 

Figure 6« shows the plot 
Here it definitely looks as 

A 

The a 's for the V4 items 

Figure 7 compares the 




fference from the 45° line is hard to 

that are far away from the 45° 1 line had 

■ A ' ' s 

c value in one calibration but not 



•s for V4 and Z5 respectively, 
are not on the same scale, 
greater than 45° • 
timateld, with the 2 bad common items with 



the b f 8 estimated with 25 commdn items. 1 Here the points "for the V4 
items are above the 45° line, and points fpr the Z5 items are below the 



line. The plots comparing the a •s 



common items do not put the parameters for 



As suspected, with the 2 bad items t 



in Figure 8 confirm that the 2 bad 



items err iri one direction and for 



The reason for putting Z5 and 



Z5 and ,V4 on * the same metric. 



e parameters for one set of the unique 



the other set, in the opposite direction. 



Z5 to V4 using true-score equating 



v 



V4 on the same scale was to equate 



items to put the two forms on the same scale 



Wha 



effect does us in^: only 2 common 
have on the txuo-rcore equating 
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Figure 5* Comparison of (the b's estimated with 2 good common 
items and the b 's *st^ separately for V4 and , ;w 

, j r : _ ^ f : , :. .J;i:^.iJ. 





A-V4-MT1MATE0 WITH E5 COMMON I TIMS 




A-Z5-EST I MATEO WITH 25 COMMON ITEMS 



Figure 6. - /Comparison of the a f s estimated with 2 good common 
itema and the a f s estimated with 25 common items, separately for V4 and 

Z5. / \ * % . - ; " 

o " ' so . , . 
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Figure 7. Comparison of the b 's estimated with 2 bad common 
items and the b 's estimated with 25 common items, separately for V4 and 



Z5. 
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A-ZS-ESTIMATEO WITH 28 COMMON ITEMS 



Figure 8. ' Comparison of the . a f s estimated with 2 bad common:* ' ^A^^P? 
it ems \ and the a 1 s estimated with 25 common items , separately for Y4 aftd ; 



between the two forms? Figure 9 shows three true-score equating 
the solid line is - the equating line found when - the- /parameter 

using 25 common items, the dotted line is the equati ng^line fouhdi; when the;^ ^ 
parameters are estimated using just the 2 /good common iten^ llQeggi 
is found when the parameters are estimated using just the 2 bad, gco 
items. ' For this equating, true scores on form Z5 are first equated to ■ . ; c 
true scores on V4 . Then the true scores on V4, are ; converted to scaled 
scores between 100 and 800 by a linear transformation* Using i the equating 
line with the 25 items as a criterion, the equating using 2 bad common items 
is worse than the equating using 2 good common items. The equating using the 
2 good common items is close to the equating with 25 common items; the. 
maximum scaled score difference is 8 points. * 

All of these results assume that the item parameters estimated using 
25 common items are on the same scale. This analysis should be repeated in a 
situation .where one knows that all of the parameters used as,a criterion arfe. 
on a common scale. From the results so far , it appears that good linking may 
be* obtained with as. few as five common items or less. However, thiese results 
only apply when the item parameters for the two forms are put on a common 
scale by estimating all of them in one calibration run. These-re 
apply when -the two tests are- calibrated in two separate runs :and the 
parameters are put on a common scale using some linear transformation 
determined from the' co^ " r "~\ ' ^~r—-- ~— -— - - 
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Figure 9. Comparisons of the three true-score equa tings of test Z5 
to test V4 : using 25 common items, using 2 good connnon items, andcifeing 
2 bad common items. 
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The conclusion that good linking may be obtained with as few as five 
common items ' is more optimistic than the conclusions reached by Vale et al . ;' 
C1981) andi by McKinley and Reckase ^ 

may be due to thej facts that 1) their scaling was based on estimated 6 1 s, 
and 2) they used three estimation runs instead of ; one concur rente irun* Our 
* differences with McKinley and Reckase "are probably due to 
their study 1) the responses of some examinees to some items (as we • \ fe- i : 
understand it) often appeared twice in the same concurrent. LOGIST run i\" : '\'p:\^iiA 
violating -the assumption of local independence; and f more importa^ 
they pooled the Iowa Tests of Educational Development ' covering ;. seven i • 
different achievement areas , and analyzed the resulting imilt idimensional • ^ 
pool of items as if it were unidimensional . 

v Summary ■■• ; \ ■ / ' '■■ y v.--- ■■■■ f^--./'" 

The asymptotic" sampling variance-covariance maltrix of ; maximum likeli- ^ ^ 

. hood estimators when both abilities and item parameters are unkiibwn was p^tfffi'X 

used to 8 tudy several problems; in item \ response theory j such as ^thej-r(Ktent^^;y- 

to which more - items , more examinees , or a different dis tribut ion of . -l-p : %r^y:^i^ 

abilities will provide better estimates of parameters^ ^ ? 

values of n - and )>. N studied that that the standard error of 6 varies 

inversely as /n, but is only moderately affected by changes in ^N•■•;•:;the— :v,.v••• 
. l ■. ; \ .. . • ; ; . ■• » .y . '.< / . - ? : ' ■ 

standard error of the estimated item parameters varies inversely as /ll V ^ y 
but is only slightly ^af fected by changes in n '•• fe^; 
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;;-<-'r/; : - ' ^\-v;;^;--:-4d--v;. v,; : ; ^/-v;;- vv*>; ; y:y\.. :yr « 

A rectangular distribution of abilites gives smaller standard errrors 

■ ' «; . O"' / r • . - . . ' - ■ , . ; . ; . ,. / . 

for the item parameters * thain doubling t^ for low : 

; „ _ ' ■ ". ' • k . -.\ ; - y i ' l ■; . >tA : "/ ~ . 

A r s; also tor C 's 'for items with B'jr 2/A less than .V^l^the Vstahdard ^ " . 

errors coinput ed with a; rec t angular distribution of ability were nearly as 

low as the' standard errprs computed with a : bell-shaped distribution and 

quadruple the number of peopleV 0yy'''yy' : '^dWS 

• , ; . - : ^./yy 

With the variance- covariance, matrix computed >ghenj all§ 

. ■■ . • * * \'i 

treated as unknown; one can study the/ effect of the ^ 

• ••.'•/'•':'• t ;yyy: y:.. ; : \ .^ v '-'/^ 

on the standard errors pf the unique items when ea^h of two tests containing 
common items is administered to a different group pf 'examinees and the 
parameters for both tests are calibrated in one LOGIST run. This prpblem : 
cannot be dealt with at all by previously available sapling 
The number of common items has little effect on the standard errors of ttie 
parameters for the unique' items/ The stan^ 

as 2 items aay be sufficient providing the parameter estimates for these two 

items are well determined • However when 4wo test 8 were actually H ^. 

.■■ 9 " ; . - - ^ v " y : ' ' " 

calibrated in one LOGIST run using 2 common items that par a^ter 

estimates with low standard errors, the parameters were not quite on the 

same scale as the parameters estimated with 25; <^>m^ 

were very close to the same scaile bu 

were on a slightly different scale ; Although 2 items are 1^ 
enough, adequate linking may be possible with as few as five items. 
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